Optimizing pandas - Reducing 90% memory footprint - updated version

python
Have you ever deal with data files that does not fit into your memory? Here is a function that just trim memory footprint for you. This post is base on https://www.dataquest.io/blog/pandas-big-data/ and updated with a new automated functions from https://github.com/ianozsvald/dtype_diet/blob/master/dtype_diet.py
Published

November 10, 2020

Todo

  • TWO options to automatically optimize pandas

We can check some basic info about the data with pandas .info() function

df_gamelogs.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB

We can see the data has 171907 rows and 161 columns and 859.4 MB memory. Let’s see how much we can optimize dtype_diet.

proposed_df = report_on_dataframe(df_gamelogs, unit="MB")
proposed_df
Current dtype Proposed dtype Current Memory (MB) Proposed Memory (MB) Ram Usage Improvement (MB) Ram Usage Improvement (%)
Column
date int64 int32 671.574219 335.818359 335.755859 49.995347
number_of_game int64 int8 671.574219 84.001465 587.572754 87.491857
day_of_week object category 5036.400391 84.362793 4952.037598 98.324939
v_name object category 5036.400391 174.776367 4861.624023 96.529736
v_league object category 4952.461426 84.359375 4868.102051 98.296617
... ... ... ... ... ... ...
h_player_9_id object category 4955.471680 412.757324 4542.714355 91.670675
h_player_9_name object category 5225.463379 421.197266 4804.266113 91.939523
h_player_9_def_pos float64 float16 671.574219 167.940430 503.633789 74.993020
additional_info object category 2714.671875 190.601074 2524.070801 92.978854
acquisition_info object category 4749.209961 84.070801 4665.139160 98.229794

161 rows × 6 columns

new_df = optimize_dtypes(df_gamelogs, proposed_df) # Avoid Type conversion error from int64 to int 8 with NA
print(f'Original df memory: {df_gamelogs.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
Original df memory: 860.500262260437 MB
Propsed df memory: 79.04368686676025 MB
# sell_prices.csv.zip Source data: https://www.kaggle.com/c/m5-forecasting-uncertainty/
df = pd.read_csv('../data/sell_prices.csv.zip')
proposed_df = report_on_dataframe(df, unit="MB")
new_df = optimize_dtypes(df, proposed_df) # Avoid Type conversion error from int64 to int 8 with NA
print(f'Original df memory: {df_gamelogs.memory_usage(deep=True).sum()/1024/1024} MB')
print(f'Propsed df memory: {new_df.memory_usage(deep=True).sum()/1024/1024} MB')
Original df memory: 860.500262260437 MB
Propsed df memory: 85.09655094146729 MB
## hide
## collapse-hide